Terminal Bench 2.0

A harbor-native benchmark of 89 expert-crafted tasks measuring how well AI agents master real terminal environments across SWE, ML, security, and data science

Published

September 14, 2025

Keywords: Terminal Bench, terminal-bench 2.0, AI agent benchmark, terminal mastery, Harbor framework, Stanford, Laude, software engineering, machine learning, security, data science, coding agent evaluation, LLM agent leaderboard

Introduction

Most AI benchmarks test what models know. Terminal Bench tests what agents can do — inside a real terminal, with real tools, on real tasks.

Terminal Bench 2.0 is a harbor-native benchmark of 89 high-quality tasks that measure AI agents’ ability to operate autonomously in terminal environments. Tasks span software engineering, machine learning, security, data science, scientific computing, and system administration — requiring agents to install software, debug code, crack hashes, train models, configure servers, and much more.

“Terminal-bench: benchmarks for AI agents in terminal environments. Harbor-native benchmarks to quantify agents’ terminal mastery.” — tbench.ai

graph LR
    A["Traditional Code Benchmarks<br/>(HumanEval, SWE-bench)<br/>Code generation focus"] --> B["Limited to<br/>code editing"]
    B --> C["Terminal Bench 2.0<br/>89 real terminal tasks<br/>Full system mastery"]
    C --> D["Measures true<br/>agent autonomy"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is Terminal Bench 2.0?

Terminal Bench 2.0 is the second major version of the terminal-bench benchmark suite — a Stanford x Laude collaboration — designed to evaluate AI agents’ ability to solve complex, real-world tasks inside terminal environments. Unlike benchmarks that test isolated coding ability, Terminal Bench drops agents into full Linux environments and asks them to accomplish goals that a skilled software engineer or system administrator would handle.

Each task provides:

A detailed natural language description of the goal
A Docker container with the required environment pre-configured
Automated verification scripts that check whether the task was completed correctly

Key Characteristics

Feature	Details
Total tasks	89
Categories	Software engineering, ML, security, data science, scientific computing, system administration, debugging, and more
Difficulty levels	Easy, Medium, Hard
Evaluation	Harbor-native (via Harbor framework)
Metric	% Resolved — percentage of the 89 tasks fully completed
Anti-contamination	Canary string embedded in benchmark data
Versions	1.0 (legacy), 2.0 (live), 3.0 (in development), Science 1.0 (in development)

How Evaluation Works

Terminal Bench uses the Harbor framework for evaluation. Agents are given access to a terminal environment and must complete each task autonomously. The evaluation command is straightforward:

harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5

Each task has automated verification that checks the agent’s work against precise success criteria — file contents, service availability, test outcomes, or computed results.

graph TD
    A["Task Description<br/>(natural language)"] --> B["AI Agent"]
    C["Docker Container<br/>(pre-configured environment)"] --> B
    B --> D["Agent works in<br/>terminal autonomously"]
    D --> E["Automated Verification<br/>(success criteria checks)"]
    E --> F["Pass / Fail"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#2c3e50,color:#fff,stroke:#333

Who Built It?

Terminal Bench is a Stanford x Laude collaboration. The benchmark tasks were crafted by experts including:

Nicholas Carlini — Google DeepMind researcher, prolific task creator (security, software engineering, creative challenges)
Jan-Lucas Uslu — Task creator (security, hardware)
Junhong Shen — Task creator (system administration, data science)
Karl Krauth — Task creator (biology, scientific computing)
jeffreywpli, dwahdany — Task creators (data processing, ML)
And many other contributors from Stanford, Google, and the broader research community

Resource	Link
Website	tbench.ai
Leaderboard	tbench.ai/leaderboard/terminal-bench/2.0
Submission instructions	HuggingFace: harborframework/terminal-bench-2-leaderboard
Harbor Framework	harborframework.com

What Skills Does It Test?

Terminal Bench 2.0 tests a remarkably diverse set of real-world terminal skills — far beyond what traditional coding benchmarks cover:

graph TD
    TB["Terminal Bench 2.0<br/>89 tasks"] --> SWE["Software Engineering<br/>Build, compile, debug"]
    TB --> ML["Machine Learning<br/>Train models, inference"]
    TB --> SEC["Security<br/>Crack hashes, find vulns"]
    TB --> DS["Data Science<br/>Process, query, analyze"]
    TB --> SCI["Scientific Computing<br/>Statistics, biology, physics"]
    TB --> SYS["System Administration<br/>Servers, VMs, configs"]

    style TB fill:#e74c3c,color:#fff,stroke:#333
    style SWE fill:#3498db,color:#fff,stroke:#333
    style ML fill:#27ae60,color:#fff,stroke:#333
    style SEC fill:#8e44ad,color:#fff,stroke:#333
    style DS fill:#f39c12,color:#fff,stroke:#333
    style SCI fill:#e67e22,color:#fff,stroke:#333
    style SYS fill:#6cc3d5,color:#fff,stroke:#333

Category	Example Tasks	Difficulty
Software engineering	Build POV-Ray from source, write a MIPS interpreter, implement pipeline parallelism in PyTorch	Easy–Hard
Machine learning	Train a FastText model on Yelp data, implement LLM inference batching scheduler, recover PyTorch model architecture	Medium–Hard
Security	Crack a 7z hash, exploit XSS filter bypasses, extract secrets from binaries, perform differential cryptanalysis	Medium–Hard
Data science	Reshard C4 dataset, merge multi-source data, optimize SQL queries, set up HuggingFace model inference	Medium
Scientific computing	DNA assembly primer design, Raman spectrum fitting, MCMC sampling with Stan, adaptive rejection sampling	Medium–Hard
System administration	Configure git webserver with auto-deploy, run Windows 3.11 in QEMU, set up mailing list servers, compile CompCert	Medium–Hard
Debugging	Fix OCaml garbage collector, resolve C++ heap crashes, recover corrupted SQLite databases	Medium–Hard

What Makes These Tasks Hard?

Unlike isolated coding challenges, Terminal Bench tasks require agents to:

Navigate complex environments — install dependencies, configure build systems, manage services
Chain multiple skills — a single task might require downloading, compiling, configuring, and verifying
Handle real-world messiness — legacy code (COBOL modernization), obscure formats (G-code), corrupted data (WAL recovery)
Demonstrate deep domain knowledge — from molecular biology (DNA assembly) to cryptography (FEAL attacks) to retro computing (Windows 3.11)

Current Leaderboard

The leaderboard below shows the top-performing agent–model combinations on Terminal Bench 2.0, ranked by % Resolved (percentage of 89 tasks completed successfully).

Source: Terminal Bench 2.0 Leaderboard (consulted July 2025). 120 total entries. Results verified by Terminal Bench team members.

Rank	Agent	Model	Organization	% Resolved
1	ForgeCode	Claude Opus 4.6	ForgeCode / Anthropic	81.8 ± 1.7
1	ForgeCode	GPT-5.4	ForgeCode / OpenAI	81.8 ± 2.0
3	TongAgents	Gemini 3.1 Pro	BIGAI / Google	80.2 ± 2.6
4	ForgeCode	Gemini 3.1 Pro	ForgeCode / Google	78.4 ± 1.8
5	SageAgent	GPT-5.3-Codex	OpenSage / OpenAI	78.4 ± 2.2
6	Droid	GPT-5.3-Codex	Factory / OpenAI	77.3 ± 2.2
7	Capy	Claude Opus 4.6	Capy / Anthropic	75.3 ± 2.4
8	Simple Codex	GPT-5.3-Codex	OpenAI	75.1 ± 2.4
9	Terminus-KIRA	Gemini 3.1 Pro	KRAFTON AI / Google	74.8 ± 2.6
10	Terminus-KIRA	Claude Opus 4.6	KRAFTON AI / Anthropic	74.7 ± 2.6
11	Mux	GPT-5.3-Codex	Coder / OpenAI	74.6 ± 2.5
12	MAYA-V2	Claude Opus 4.6	ADYA / Anthropic	72.1 ± 2.2
13	TongAgents	Claude Opus 4.6	BIGAI / Anthropic	71.9 ± 2.7
14	Junie CLI	Multiple	JetBrains	71.0 ± 2.9
15	CodeBrain-1	GPT-5.3-Codex	Feeling AI / OpenAI	70.3 ± 2.6
16	Droid	Claude Opus 4.6	Factory / Anthropic	69.9 ± 2.5
17	Ante	Gemini 3 Pro	Antigma Labs / Google	69.4 ± 2.1
18	IndusAGI	GPT-5.3-Codex	SoloVpx / OpenAI	69.1 ± 2.3
19	Crux	Claude Opus 4.6	Roam / Anthropic	66.9
20	Mux	Claude Opus 4.6	Coder / Anthropic	66.5 ± 2.5

Key takeaway: The top agents now solve over 80% of Terminal Bench 2.0 tasks, but the hardest ~20% — involving deep domain expertise in cryptography, biology, and complex systems — remain largely unsolved. The leaderboard features 120 entries from diverse organizations, with specialized agent frameworks (ForgeCode, Droid, TongAgents) consistently outperforming general-purpose CLI tools.

For the full, up-to-date leaderboard, visit the links in the next section.

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource	Description	Link
Terminal Bench 2.0 Leaderboard	Full ranked leaderboard with all 120 entries, agents, and models	tbench.ai/leaderboard/terminal-bench/2.0
Task Registry	Browse all 89 tasks with descriptions, categories, and difficulty	tbench.ai/benchmarks/terminal-bench-2
Terminal Bench Home	Overview of all benchmark versions and upcoming releases	tbench.ai

Submission and Evaluation

Resource	Description	Link
Submission Instructions	How to submit your agent to the leaderboard via HuggingFace	HF: harborframework/terminal-bench-2-leaderboard
Harbor Framework	The evaluation framework used to run Terminal Bench	harborframework.com

Run the Benchmark

# Install Harbor and run Terminal Bench 2.0
harbor run -d terminal-bench@2.0 -a "your-agent" -m "your-model" -k 5

Understanding the Metrics

% Resolved

The primary metric. Each task is binary — either fully completed (verified by automated checks) or not. The score is the percentage of 89 tasks that the agent resolved successfully.

Confidence Intervals

Each leaderboard entry includes a confidence interval (± value) reflecting variance across evaluation runs. Smaller intervals indicate more consistent agent performance.

Agent vs. Model

Terminal Bench uniquely separates the agent framework (e.g., ForgeCode, Droid, Claude Code) from the underlying model (e.g., Claude Opus 4.6, GPT-5.4). This reveals that:

The same model performs very differently across agent frameworks
Specialized agent frameworks consistently outperform general-purpose tools
The agent scaffolding matters as much as the model capability

graph LR
    A["Model Capability<br/>(reasoning, knowledge)"] --> C["Task Resolution"]
    B["Agent Framework<br/>(tool use, planning)"] --> C
    C --> D["% Resolved<br/>on Terminal Bench"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333

Why Terminal Bench Matters

graph LR
    A["Code-only<br/>benchmarks"] --> B["Don't test<br/>system mastery"]
    B --> C["Terminal Bench<br/>fills the gap"]
    C --> D["Measures real<br/>agent autonomy"]

    A2["Isolated<br/>task evals"] --> B2["Miss multi-step<br/>complexity"]
    B2 --> C
    C --> D2["Drives agent<br/>framework innovation"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Tests real autonomy — Agents must navigate full environments, not just generate code snippets
Broad skill coverage — From biology to cryptography to system administration, no single skill suffices
Separates agent from model — Reveals that scaffolding and tool use matter as much as raw model capability
Practical relevance — Tasks mirror what engineers actually do in terminals every day
Anti-contamination — Canary strings and automated verification prevent benchmark gaming

Video: Terminal Bench 2.0 Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

Terminal Bench 2.0 sets a new standard for evaluating AI agents in real-world terminal environments:

89 expert-crafted tasks spanning software engineering, ML, security, data science, scientific computing, and system administration
Built as a Stanford x Laude collaboration with tasks from leading researchers including Nicholas Carlini
Evaluated via the Harbor framework — reproducible, containerized, and open for submissions
The best agents solve ~82% of tasks, but the hardest challenges in cryptography, biology, and complex systems remain unsolved
The benchmark uniquely separates agent framework from model, revealing that scaffolding matters as much as raw capability

With Terminal Bench 3.0 and Terminal Bench Science already in development, this benchmark family is rapidly evolving to keep pace with agent capabilities — ensuring we have a rigorous measure of what AI agents can truly accomplish when given a terminal and a goal.

References

Terminal Bench Team. “Terminal-Bench: Benchmarks for AI Agents in Terminal Environments.” tbench.ai
Terminal Bench Team. “Terminal Bench 2.0 Leaderboard.” tbench.ai/leaderboard/terminal-bench/2.0 (consulted July 2025)
Terminal Bench Team. “Terminal Bench 2.0 Task Registry.” tbench.ai/benchmarks/terminal-bench-2
Harbor Framework. “Harbor: Evaluation Framework for AI Agents.” harborframework.com
Harbor Framework. “Terminal Bench 2 Leaderboard Submissions.” HuggingFace. huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard

See how agents tackle real GitHub issues — SWE-bench Verified
Evaluate factual accuracy in LLMs — SimpleQA
Test visual grounding on screens — ScreenSpot Pro
Explore the hardest academic benchmark — Humanity’s Last Exam
Deploy models for running your own evaluations — Deploying and Serving LLM with vLLM
Track costs when running evaluations — FinOps Best Practices for LLM Applications
Terminal Bench Official Website
Terminal Bench 2.0 Leaderboard
Harbor Framework